The University of Edinburgh Art Collection “supports the world-leading research and teaching that happens within the University. Comprised of an astonishing range of objects and ideas spanning two millennia and a multitude of artistic forms, the collection reflects not only the long and rich trajectory of the University, but also major national and international shifts in art history.”1 Source: https://collections.ed.ac.uk/art/about.
See the sidebar here and note that there are 2986 pieces in the art collection we’re collecting data on.
In this workshop we’ll scrape data on all art pieces in the Edinburgh College of Art collection.
Before getting started, let’s check that a bot has permissions to access pages on this domain.
library(robotstxt)
paths_allowed("https://collections.ed.ac.uk/art")##
collections.ed.ac.uk
## [1] TRUE
Complete the following steps before you join the live workshop!
You have three tasks you should complete before the workshop:
Task 2: Download and install the SelectorGadget for your browser. Once you do, you should now be able to access SelectorGadget by clicking on the icon next to the search bar in your Chrome or Firefox browser.
Complete the following steps during the live workshop with your team.
As usual, start out by cloning your lab repo, named lab-04-uoe-art-YOUR_TEAMNAME. Each team member should clone the repo and you should take turns working on various parts of the lab. Note that each team member should make commits to the repository to be eligible for points for this assignment. And remember that when each team member takes over, their first action should be to pull from repo before adding more content.
Today we will be using both R scripts and R Markdown documents:
.R: R scripts are plain text files containing only code and brief comments,
.Rmd: R Markdown documents are plain text files containing.
Here is the organization of your repo, and the corresponding section in the lab that each file will be used for:
|-data
| |- README.md
|-lab-06-uoe-art.Rmd # analysis
|-lab-06-uoe-art.Rproj
|-README.md
|-scripts # webscraping
| |- 01-scrape-page-one.R # scraping a single page
| |- 02-scrape-page-function.R # functions
| |- 03-scrape-page-many.R # iteration
Tip: To run the code you can highlight or put your cursor next to the lines of code you want to run and hit Command+Enter.
Work in scripts/01-scrape-page-one.R.
We will start off by scraping data on the first 10 pieces in the collection from the first page of the collection listing.
First, we define a new object called first_url, which is the link above. Then, we read the page at this url with the read_html() function from the rvest package. The code for this is already provided in 01-scrape-page-one.R.
# set url
first_url <- "https://collections.ed.ac.uk/art/search/*:*/Collection:%22edinburgh+college+of+art%7C%7C%7CEdinburgh+College+of+Art%22?offset=0"
# read html page
page <- read_html(first_url)For the ten pieces on this page we will extract title, artist, and link information, and put these three variables in a data frame.
Let’s start with titles. We make use of the SelectorGadget to identify the tags for the relevant elements: